-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new benchmark MAIR #1425
base: main
Are you sure you want to change the base?
Add new benchmark MAIR #1425
Conversation
Following the above process, I am currently open a pull request in https://github.com/embeddings-benchmark/results to submit our evaluation results to the newly-added MAIR benchmark. |
When you run your tasks, MTEB will generate a folder with results from your runs, and you can submit that folder |
return | ||
self.corpus, self.queries, self.relevant_docs = {}, {}, {} | ||
queries_path = self.metadata_dict["dataset"]["path"] | ||
docs_path = self.metadata_dict["dataset"]["path"].replace("-Queries", "-Docs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you place queries and docs in same repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. To keep Q/D in one repo, I could create a separate repo for each task.
But is that necessary? I think having two repo for Q and D would be easier to manage than having over hundreds of repo for each task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can create different splits for queries and documents in the same repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see.
One issue is that the data in MAIR has a two-level structure: task → subtasks, as some tasks contain multiple subtasks (e.g., IFEval, SWE-Bench). It’s tricky to maintain this structure without flattening it into a single repository.
And I think people may not need to download all the data if they’re only interested in evaluating a few specific tasks. So, if we still need to put Q and D in a single repo, the best way might be to generate 126 separate repo (for each task).
Or do you have any other suggestions?
Another question is that our benchmark has two settings, i.e., evaluting the model with and without instruction. Should I store the result with instruction and without instruction into two files, respectively? |
If you have results for both instruct and non-instruct, it might be better to create separate tasks, though @orionw might have a clearer perspective on this |
+1 to adding a duplicate task if you have a specific instruction you want them to use for each. Otherwise models can define their own instructions and in that case you could just submit results to the same task but with a different prompt in the meta info. If you’re adding an instruction variant (and once #1359 is in) you’d just need to add a version of those tasks with all the same attributes but also a config/attribute called “self.instruction” (query-id -> instruction_text) format |
Hi. If we have duplicate tasks with different instructions, will they appear in separate tables on the leaderboard? Like would there be a one called (XXX with instruction) and another called (XXX without instruction)? |
@sunnweiwei they would appear as different datasets yes. So you could have one leaderboard with them and one without, if desired. Or push it all together into one benchmark. Does that answer the question? Or do you mean more than one instruction per dataset? |
Thanks for the answer! I was thinking to put them into one table for benchmarking purpose, maybe adding a column to indicate if instructions were used. Then people could compare models with and without instructions in the same table. Good to know we can do this then. |
Fixes #1426
Checklist
make test
.make lint
.Adding datasets checklist
Reason for dataset addition: ...
The added data is introduced in https://arxiv.org/abs/2410.10127, which introduces a benchmark for instructable information retrieval. It contains 126 real-world retrieval tasks across 6 domains, with instructions manually annotated. And the data has been sampled to reduce evaluation costs.
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.Adding a model checklist
mteb.get_model(model_name, revision)
andmteb.get_model_meta(model_name, revision)